Search CORE

4 research outputs found

AsterixDB: A Scalable, Open Source BDMS

Author: Alsubaiee Sattam
Altowim Yasser
Altwaijry Hotham
Behm Alexander
Borkar Vinayak
Bu Yingyi
Carey Michael
Cetindil Inci
Cheelangi Madhusudan
Faraaz Khurram
Gabrielova Eugenia
Grover Raman
Heilbron Zachary
Kim Young-Seok
Li Chen
Li Guangqiang
Ok Ji Mahn
Onose Nicola
Pirzadeh Pouria
Tsotras Vassilis
Vernica Rares
Wen Jian
Westmann Till
Publication venue
Publication date: 02/07/2014
Field of study

AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that supports a wide range of queries; a scalable runtime; partitioned, LSM-based data storage and indexing (including B+-tree, R-tree, and text indexes); support for external as well as natively stored data; a rich set of built-in types; support for fuzzy, spatial, and temporal types and queries; a built-in notion of data feeds for ingestion of data; and transaction support akin to that of a NoSQL store. Development of AsterixDB began in 2009 and led to a mid-2013 initial open source release. This paper is the first complete description of the resulting open source AsterixDB system. Covered herein are the system's data model, its query language, and its software architecture. Also included are a summary of the current status of the project and a first glimpse into how AsterixDB performs when compared to alternative technologies, including a parallel relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data analytics platform, for things that both technologies can do. Also included is a brief description of some initial trials that the system has undergone and the lessons learned (and plans laid) based on those early "customer" engagements

arXiv.org e-Print Archive

CiteSeerX

Analysis-Aware Approach To Entity Resolution

Author: Altwaijry Hotham
Publication venue: eScholarship, University of California
Publication date: 01/01/2015
Field of study

In the era of big data, in addition to large local repositories and data warehouses, today’s enterprises have access to a very large amount of diverse data sources, including web data repositories, continuously generated sensory data, social media posts, clickstream data from web portals, audio/video data capture, and so on. As a result, there is an increasing demand for executing up-to-the-minute analysis tasks on top of these dynamic and/or heterogeneous data sources by modern applications. Such new requirements have created challenging new problems for traditional entity resolution, and data cleaning in general, techniques. In this thesis, we respond to some of these challenges by developing an analysis-aware approach to entity resolution.First, we explore the problem of analysis-aware data cleaning in the context of selection queries. Specifically, we propose an “on-the-fly” data cleaning framework in the context of SQL-like selection queries. The objective of this framework is to perform the minimal number of cleaning steps that are required to answer a user query correctly. Our approach leverages the concept of vestigiality to reduce cleaning overhead. We conducted a comprehensive empirical evaluation of the proposed solution to demonstrate its significant advantage in terms of efficiency over the traditional techniques for the given problem settings.Subsequently, we study analysis-aware data cleaning for the more general case where queries can be complex SQL-style selections and joins. In particular, we develop a framework for integrating entity resolution techniques with query processing. The aim of this framework is to utilize the query semantics to reap the benefits of early predicate evaluation while still minimizing redundant computation in the form of data cleaning. This framework relies on the notion of polymorphic operators, which are analogous to the common relational algebra operators with one exception: they know how to test the query predicates on the dirty data prior to cleaning it. We conducted extensive experiments to evaluate the effectiveness of our approach on real and synthetic datasets.Overall, our experiments demonstrate outstanding results – that is our analysis-aware approaches are significantly better compared to traditional ER techniques, especially when the query is very selective

Ezid

eScholarship - University of California

QuERy: A Framework for Integrating Entity Resolution with Query Processing

Author: Dmitri V Kalashnikov
Hotham Altwaijry
Sharad Mehrotra
Publication venue
Publication date: 06/03/2020
Field of study

ABSTRACT This paper explores an analysis-aware data cleaning architecture for a large class of SPJ SQL queries. In particular, we propose QuERy, a novel framework for integrating entity resolution (ER) with query processing. The aim of QuERy is to correctly and efficiently answer complex queries issued on top of dirty data. The comprehensive empirical evaluation of the proposed solution demonstrates its significant advantage in terms of efficiency over the traditional techniques for the given problem settings

CiteSeerX

QDA: A Query-Driven Approach to Entity Resolution

Author: Dmitri V. Kalashnikov
Hotham Altwaijry
Sharad Mehrotra
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref